Fixing Boundary Violations

7.3 Two Numerical Issues

Two numerical problems sometimes occur and cause non-convergence. The first problem is related to outliers of the predicted value for the location parameter. When the estimated location parameter is far smaller than the lower limit or far greater than the upper limit, the definite Gaussian integral in the denominator of the probability density function will approach zero, and hence, lead to a singularity problem. The second problem is about the magnitude of the step size in each numerical iteration. In some cases, the step size generated by the inverse Hessian is overdriven by the scale parameter, and thus, too large to generate eligible parameter estimates. Through a proper adjustment to the Hessian, the step size can be under control, and thus, reaching an admissible solution becomes possible.

To illustrate the first problem, we refer to the objective function (negative loglikelihood) specified in the constrained optimization problem
\begin{equation*}
-\log L=\sum\limits_{i=1}^{n}{\ln {{D}_{i}}+\frac{1}{2{{\sigma }^{2}}}}\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}} -\boldsymbol{{{x}_{i}}\beta} \right)}^{2}}}.
\end{equation*}

When the following two conditions both exist, $\left| b-\boldsymbol{x_{i}^{*}\beta} \right|\gg 5\sigma $ and $\left| a-\boldsymbol{x_{i}^{*}\beta} \right|\gg 5\sigma $, ${{D}_{i}}$ will approach zero, and thus, the natural logarithm of ${{D}_{i}}$ will approach negative infinity. The value of the loglikelihood function is, therefore, dominated by a few large contributions from those outliers. The same issue, on the other hand, does not pose a problem to the second term of the loglikelihood function. The contribution of the square standardized deviation is relatively milder so that a few outliers do not seriously distort the result.

To cope with the outlier problem, we can set
\begin{equation*}
{{D}_{i}}=\sqrt{2\pi }\sigma \qquad if \quad \left| a-\boldsymbol{x_{i}^{*}\hat{\beta}} \right|>5\hat{\sigma }\quad \!\!\And\!\!\text{ }\left| b-\boldsymbol{x_{i}^{*}\hat{\beta}} \right|>5\hat{\sigma },
\end{equation*}
where $\Phi \left( \frac{ \boldsymbol{b-x_{i}^{*}\beta}}{\sigma } \right)-\Phi \left( \frac{ \boldsymbol{a-x_{i}^{*}\beta}}{\sigma } \right)=1$. Doing so is equivalent to setting the outlier's probability density value to almost zero
\begin{equation*}
f\left( {{y}_{i}}|\boldsymbol{{{x}_{i}}\beta} ,\sigma \right)=\frac{\exp \left[ -\frac{{{\left( y_{i}-\boldsymbol{x_{i}^{*}\beta} \right)}^{2}}}{2{{\sigma }^{2}}} \right]} {\Phi \left( \frac{b-\boldsymbol{x_{i}^{*}\beta}}{\sigma } \right)-\Phi \left( \frac{a- \boldsymbol{x_{i}^{*}\beta}}{\sigma } \right)}\to 0,
\end{equation*}
and thus, outliers do not cause great disturbance in a particular iteration. Since the occurrence of outliers is associated with inadmissible beta estimates, the number of outliers will reduce to zero when later iterations generate admissible solutions or reach convergence. The main effect of setting ${{D}_{i}}=\sqrt{2\pi }\sigma$ is to retain a smooth convergent sequence of parameter estimates in numerical optimization.

The second problem is about the proper control of the step parameter $\boldsymbol{d}$ by adjusting the Hessian. Without loss of generality, the following discussion only assumes one independent variable in the truncated regression model. Both $x$ and $\beta$ are used as a scalar. The essential rule of Newton's method in achieving sequential convergence to the minimum is
\begin{equation*}
\boldsymbol{{\gamma}_{k+1}}=\boldsymbol{{\gamma }_{k}}-{\boldsymbol{H}^{-1}}\boldsymbol{g},
\end{equation*}
where the negative gradient represents the steepest direction that the parameter estimate should move. And the inverse Hessian indicates how far the move should be by the rate of quadratic convergence. Let the step parameter ${\boldsymbol{d}^{\left( k \right)}}={{\left( d_{\beta }^{(k)},d_{\sigma }^{(k)} \right)}^{T}}$ represent the difference of the parameter estimate from the $k$th to $(k+1)$th iteration, thus
\begin{equation*}
\left(
\begin{matrix}
d_{\beta }^{(k)} \\
d_{\sigma }^{(k)} \\
\end{matrix}
\right)=
-\left(
\begin{matrix}
\frac{{{\partial }^{2}}\left(-\log L\right)}{\partial {{\beta}^{2}}} & \frac{{{\partial}^{2}}\left(-\log L \right)} {\partial \beta \partial \sigma} \\
\frac{{{\partial}^{2}}\left(-\log L \right)}{\partial \beta \partial \sigma }&\frac{{{\partial}^{2}} \left(-\log L \right)}{\partial {{\sigma }^{2}}} \\
\end{matrix}
\right)^{-1}
\left(
\begin{matrix}
\frac{\partial \left( -\log L \right)}{\partial \beta } \\
\frac{\partial \left( -\log L \right)}{\partial \sigma }
\end{matrix}
\right){\Bigr|}_{\boldsymbol{\gamma }^{\left( k \right)}},
\end{equation*}
and specifically,²⁰
\begin{align*}
\frac{\partial \left( -\log L \right)}{\partial \beta }&=\sum\limits_{i=1}^{n}{\frac{1}{{{D}_{i}}} \left( \frac{\partial {{D}_{i}}}{\partial \beta } \right)}-\frac{1}{{{\sigma }^{2}}}\sum\limits_{i=1}^{n} {\left( {{y}_{i}}-\beta x_{i}^{*} \right)x_{i}^{*}} \\
\frac{\partial \left( -\log L \right)}{\partial \sigma }&=\sum\limits_{i=1}^{n}{\frac{1}{{{D}_{i}}}\left( \frac{\partial {{D}_{i}}}{\partial \sigma } \right)}-\frac{1}{{{\sigma }^{3}}}\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}}- \beta x_{i}^{*} \right)}^{2}}}\\
\frac{{{\partial }^{2}}\left( -\log L \right)}{\partial {{\beta }^{2}}}&=\sum\limits_{i=1}^{n} {\frac{-1}{D_{i}^{2}}{{\left( \frac{\partial {{D}_{i}}}{\partial \beta } \right)}^{2}}}+\sum\limits_{i=1}^{n} {\frac{1}{{{D}_{i}}}}\left( \frac{{{\partial }^{2}}{{D}_{i}}}{\partial {{\beta }^{2}}} \right) +\frac{1}{{{\sigma }^{2}}}\sum\limits_{i=1}^{n}{{{\left( x_{i}^{*} \right)}^{2}}}\\
\frac{{{\partial }^{2}}\left( -\log L \right)}{\partial \beta \partial \sigma }&=\sum\limits_{i=1}^{n} {\frac{-1}{D_{i}^{2}}\left( \frac{\partial {{D}_{i}}}{\partial \beta } \right)\left( \frac{\partial {{D}_{i}}}{\partial \sigma } \right)}+\sum\limits_{i=1}^{n}{\frac{1}{{{D}_{i}}}} \left( \frac{{{\partial }^{2}}{{D}_{i}}}{\partial \beta \partial \sigma } \right)+\frac{2}{{{\sigma }^{3}}} \sum\limits_{i=1}^{n}{\left( {{y}_{i}}-\beta x_{i}^{*} \right)\left( x_{i}^{*} \right)} \\
\frac{{{\partial }^{2}}\left( -\log L \right)}{\partial {{\sigma }^{2}}}&=\sum\limits_{i=1}^{n} {\frac{-1}{D_{i}^{2}}{{\left( \frac{\partial {{D}_{i}}}{\partial \sigma } \right)}^{2}}} +\sum\limits_{i=1}^{n}{\frac{1}{{{D}_{i}}}}\left( \frac{{{\partial }^{2}}{{D}_{i}}}{\partial {{\sigma }^{2}}} \right)+\frac{3}{{{\sigma }^{4}}}\sum\limits_{i=1}^{n}{{{\left( {{y}_{i}} -\beta x_{i}^{*} \right)}^{2}}}.
\end{align*}

By multiplying the last term of ${{{\partial }^{2}}\left( -\log L \right)}/{\partial {{\beta }^{2}}}\;$ with a factor, we can control the step parameter $d_{\sigma}$ at the same level while reducing the step parameter $d_{\beta}$. To see why this is the case, we first assume
\begin{align*}
&\frac{{{\partial }^{2}}\left( -\log L \right)}{\partial {{\beta }^{2}}}={{t}_{1}}+{{t}_{2}}, \quad \frac{{{\partial }^{2}}\left( -\log L \right)}{\partial \beta \partial \sigma }={{t}_{3}}, \quad \frac{{{\partial }^{2}}\left( -\log L \right)}{\partial {{\sigma }^{2}}}={{t}_{4}} \\
&\frac{\partial \left( -\log L \right)}{\partial \beta }={{g}_{1}},\quad \frac{\partial \left( -\log L \right)}{\partial \sigma }={{g}_{2}},
\end{align*}
where
\begin{equation*}
{{t}_{1}}=\sum\limits_{i=1}^{n}{\frac{-1}{D_{i}^{2}}{{\left( \frac{\partial {{D}_{i}}}{\partial \beta } \right)}^{2}}}+\sum\limits_{i=1}^{n}{\frac{1}{{{D}_{i}}}}\left( \frac{{{\partial }^{2}}{{D}_{i}}}{\partial {{\beta }^{2}}} \right), \quad
{{t}_{2}}=\frac{1}{{{\sigma }^{2}}}\sum\limits_{i=1}^{n}{{{\left( x_{i}^{*} \right)}^{2}}}.
\end{equation*}
If we multiply $t_{2}$ with a factor $\tau $, the new Hessian becomes
\begin{equation*}
\boldsymbol{{H}^{*}}=
\left(
\begin{matrix}
{{t}_{1}}+\tau {{t}_{2}} & {{t}_{3}} \\
{{t}_{3}} & {{t}_{4}} \\
\end{matrix}
\right).
\end{equation*}
With a few manipulations, we derive the new step parameter $d_{\beta }^{*}$, which is always smaller than the original step parameter $d_{\beta }$ if $\tau>1$.
\begin{equation*}
d_{\beta}^{*}=-\frac{{{t}_{4}}{{g}_{\beta }}-{{t}_{3}}{{g}_{\sigma }}}{\left( {{t}_{1}}+{{t}_{2}} \right) {{t}_{4}}-t_{3}^{2}+\left( \tau -1 \right){{t}_{2}}{{t}_{4}}}\le -\frac{{{t}_{4}}{{g}_{\beta }}-{{t}_{3}} {{g}_{\sigma }}}{\left( {{t}_{1}}+{{t}_{2}} \right){{t}_{4}}-t_{3}^{2}}={{d}_{\beta }},
\end{equation*}
where $\left( \tau -1 \right){{t}_{2}}{{t}_{4}}>0$. Meanwhile, the new step parameter $d_{\sigma }^{*}$ remains close to the original step parameter $d_{\sigma }$, and when $\tau \gg 1$
\begin{equation*}
d_{\sigma }^{*}=-\frac{\left( {{t}_{1}}+{{t}_{2}} \right){{g}_{\sigma }}-{{t}_{3}}{{g}_{\beta }}+\left( \tau -1 \right){{t}_{2}}{{g}_{\sigma }}}{\left( {{t}_{1}}+{{t}_{2}} \right){{t}_{4}}-t_{3}^{2}+\left( \tau -1 \right){{t}_{2}}{{t}_{4}}}\to -\frac{{{g}_{\sigma }}}{{{t}_{4}}}.
\end{equation*}
This indicates that the Hessian will approach a diagonal matrix (${{t}_{3}}\to 0$) when $\tau$ is far greater than 1
\begin{equation*}
\left(
\begin{matrix}
d_{\beta}^{*} \\
d_{\sigma}^{*} \\
\end{matrix}
\right)=
-{\left(
\begin{matrix}
{{t}_{1}}+\tau {{t}_{2}} & 0 \\
0 & {{t}_{4}} \\
\end{matrix}
\right)}^{-1}
\left(
\begin{matrix}
{{g}_{\beta }} \\
{{g}_{\sigma }} \\
\end{matrix}
\right).
\end{equation*}
With the above knowledge, we can reduce the step parameter $d_{\beta}^{*}$ arbitrarily while keeping the step parameter $d_{\sigma}^{*}$ at a certain level by increasing the factor $\tau$. This technique of adjusting the Hessian is important in finding an admissible solution during the numerical analysis. In general, the original Hessian tends to generate too large a step and causes non-convergence as the number of covariates increases. The convergent rate would be slower when a smaller step $d_{\beta}^{*}$ is in use. Most importantly, the final results of parameter estimation do not differ much if $\tau$ is chosen within a limited range. In the replication studies, we set $\tau$ as $14$ and $4$ for Model I and II, respectively.

____________________

Footnote

²⁰The following items specify the gradient and the Hessian for the simplest truncated regression model without the constant. The specification can be easily extended to the multivariate context by adding variable indicators.